Put simply, the task of clustering is to place observations that seem similar within the same cluster. Clustering is commonly used in two dimensional data where the goal is to create clusters based on coordinates. Here, we will use something similar. We will cluster houses based on their latitude-longitude locations using several different clustering methods.
# Packages we will use throughout this notebook
using Clustering
using VegaLite
using VegaDatasets
using DataFrames
using Statistics
using JSON
using CSV
using Distances
We will start off by getting some data. We will use data of 20,000+ California houses dataset. We will then learn whether housing prices directly correlate with map location.
download("https://raw.githubusercontent.com/ageron/handson-ml/master/datasets/housing/housing.csv","newhouses.csv")
houses = CSV.read("newhouses.csv", DataFrame)
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | |
|---|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64? | Float64 | |
| 1 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 |
| 2 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 |
| 3 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 |
| 4 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 |
| 5 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 |
| 6 | -122.25 | 37.85 | 52.0 | 919.0 | 213.0 | 413.0 |
| 7 | -122.25 | 37.84 | 52.0 | 2535.0 | 489.0 | 1094.0 |
| 8 | -122.25 | 37.84 | 52.0 | 3104.0 | 687.0 | 1157.0 |
| 9 | -122.26 | 37.84 | 42.0 | 2555.0 | 665.0 | 1206.0 |
| 10 | -122.25 | 37.84 | 52.0 | 3549.0 | 707.0 | 1551.0 |
| 11 | -122.26 | 37.85 | 52.0 | 2202.0 | 434.0 | 910.0 |
| 12 | -122.26 | 37.85 | 52.0 | 3503.0 | 752.0 | 1504.0 |
| 13 | -122.26 | 37.85 | 52.0 | 2491.0 | 474.0 | 1098.0 |
| 14 | -122.26 | 37.84 | 52.0 | 696.0 | 191.0 | 345.0 |
| 15 | -122.26 | 37.85 | 52.0 | 2643.0 | 626.0 | 1212.0 |
| 16 | -122.26 | 37.85 | 50.0 | 1120.0 | 283.0 | 697.0 |
| 17 | -122.27 | 37.85 | 52.0 | 1966.0 | 347.0 | 793.0 |
| 18 | -122.27 | 37.85 | 52.0 | 1228.0 | 293.0 | 648.0 |
| 19 | -122.26 | 37.84 | 50.0 | 2239.0 | 455.0 | 990.0 |
| 20 | -122.27 | 37.84 | 52.0 | 1503.0 | 298.0 | 690.0 |
| 21 | -122.27 | 37.85 | 40.0 | 751.0 | 184.0 | 409.0 |
| 22 | -122.27 | 37.85 | 42.0 | 1639.0 | 367.0 | 929.0 |
| 23 | -122.27 | 37.84 | 52.0 | 2436.0 | 541.0 | 1015.0 |
| 24 | -122.27 | 37.84 | 52.0 | 1688.0 | 337.0 | 853.0 |
| 25 | -122.27 | 37.84 | 52.0 | 2224.0 | 437.0 | 1006.0 |
| 26 | -122.28 | 37.85 | 41.0 | 535.0 | 123.0 | 317.0 |
| 27 | -122.28 | 37.85 | 49.0 | 1130.0 | 244.0 | 607.0 |
| 28 | -122.28 | 37.85 | 52.0 | 1898.0 | 421.0 | 1102.0 |
| 29 | -122.28 | 37.84 | 50.0 | 2082.0 | 492.0 | 1131.0 |
| 30 | -122.28 | 37.84 | 52.0 | 729.0 | 160.0 | 395.0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
names(houses)
10-element Vector{String}:
"longitude"
"latitude"
"housing_median_age"
"total_rooms"
"total_bedrooms"
"population"
"households"
"median_income"
"median_house_value"
"ocean_proximity"
We will use the VegaLite package here for plotting. This package makes it very easy to plot information on a map. All you need is a JSON file of the map you intend to draw. Here, we will use the California counties JSON file and plot each house on the map and color code it via a heatmap of the price. This is done by this line color="median_house_value:q"
cali_shape = JSON.parsefile("data/california-counties.json")
VV = VegaDatasets.VegaJSONDataset(cali_shape,"data/california-counties.json")
@vlplot(width=500, height=300) +
@vlplot(
mark={
:geoshape,
fill=:black,
stroke=:white
},
data={
values=VV,
format={
type=:topojson,
feature=:cb_2015_california_county_20m
}
},
projection={type=:albersUsa},
)+
@vlplot(
:circle,
data=houses,
projection={type=:albersUsa},
longitude="longitude:q",
latitude="latitude:q",
size={value=12},
color="median_house_value:q"
)
names(houses)
10-element Vector{String}:
"longitude"
"latitude"
"housing_median_age"
"total_rooms"
"total_bedrooms"
"population"
"households"
"median_income"
"median_house_value"
"ocean_proximity"
Note that the cell above may take a few minutes to run!
One thing we will try and explore in this notebook is if clustering the houses has any direct relationship with their prices, so we will bucket the houses into intervals of $50000 and re perform the color codes based on each bucket.
bucketprice = Int.(div.(houses[!,:median_house_value],50000))
insertcols!(houses,3,:cprice=>bucketprice)
@vlplot(width=500, height=300) +
@vlplot(
mark={
:geoshape,
fill=:black,
stroke=:white
},
data={
values=VV,
format={
type=:topojson,
feature=:cb_2015_california_county_20m
}
},
projection={type=:albersUsa},
)+
@vlplot(
:circle,
data=houses,
projection={type=:albersUsa},
longitude="longitude:q",
latitude="latitude:q",
size={value=12},
color="cprice:n"
)
X = houses[!, [:latitude,:longitude]]
C = kmeans(Matrix(X)', 10)
insertcols!(houses,3,:cluster10=>C.assignments)
| longitude | latitude | cluster10 | cprice | housing_median_age | total_rooms | total_bedrooms | |
|---|---|---|---|---|---|---|---|
| Float64 | Float64 | Int64 | Int64 | Float64 | Float64 | Float64? | |
| 1 | -122.23 | 37.88 | 2 | 9 | 41.0 | 880.0 | 129.0 |
| 2 | -122.22 | 37.86 | 2 | 7 | 21.0 | 7099.0 | 1106.0 |
| 3 | -122.24 | 37.85 | 2 | 7 | 52.0 | 1467.0 | 190.0 |
| 4 | -122.25 | 37.85 | 2 | 6 | 52.0 | 1274.0 | 235.0 |
| 5 | -122.25 | 37.85 | 2 | 6 | 52.0 | 1627.0 | 280.0 |
| 6 | -122.25 | 37.85 | 2 | 5 | 52.0 | 919.0 | 213.0 |
| 7 | -122.25 | 37.84 | 2 | 5 | 52.0 | 2535.0 | 489.0 |
| 8 | -122.25 | 37.84 | 2 | 4 | 52.0 | 3104.0 | 687.0 |
| 9 | -122.26 | 37.84 | 2 | 4 | 42.0 | 2555.0 | 665.0 |
| 10 | -122.25 | 37.84 | 2 | 5 | 52.0 | 3549.0 | 707.0 |
| 11 | -122.26 | 37.85 | 2 | 5 | 52.0 | 2202.0 | 434.0 |
| 12 | -122.26 | 37.85 | 2 | 4 | 52.0 | 3503.0 | 752.0 |
| 13 | -122.26 | 37.85 | 2 | 4 | 52.0 | 2491.0 | 474.0 |
| 14 | -122.26 | 37.84 | 2 | 3 | 52.0 | 696.0 | 191.0 |
| 15 | -122.26 | 37.85 | 2 | 3 | 52.0 | 2643.0 | 626.0 |
| 16 | -122.26 | 37.85 | 2 | 2 | 50.0 | 1120.0 | 283.0 |
| 17 | -122.27 | 37.85 | 2 | 3 | 52.0 | 1966.0 | 347.0 |
| 18 | -122.27 | 37.85 | 2 | 3 | 52.0 | 1228.0 | 293.0 |
| 19 | -122.26 | 37.84 | 2 | 3 | 50.0 | 2239.0 | 455.0 |
| 20 | -122.27 | 37.84 | 2 | 3 | 52.0 | 1503.0 | 298.0 |
| 21 | -122.27 | 37.85 | 2 | 2 | 40.0 | 751.0 | 184.0 |
| 22 | -122.27 | 37.85 | 2 | 3 | 42.0 | 1639.0 | 367.0 |
| 23 | -122.27 | 37.84 | 2 | 2 | 52.0 | 2436.0 | 541.0 |
| 24 | -122.27 | 37.84 | 2 | 1 | 52.0 | 1688.0 | 337.0 |
| 25 | -122.27 | 37.84 | 2 | 2 | 52.0 | 2224.0 | 437.0 |
| 26 | -122.28 | 37.85 | 2 | 2 | 41.0 | 535.0 | 123.0 |
| 27 | -122.28 | 37.85 | 2 | 1 | 49.0 | 1130.0 | 244.0 |
| 28 | -122.28 | 37.85 | 2 | 2 | 52.0 | 1898.0 | 421.0 |
| 29 | -122.28 | 37.84 | 2 | 2 | 50.0 | 2082.0 | 492.0 |
| 30 | -122.28 | 37.84 | 2 | 2 | 52.0 | 729.0 | 160.0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
@vlplot(width=500, height=300) +
@vlplot(
mark={
:geoshape,
fill=:black,
stroke=:white
},
data={
values=VV,
format={
type=:topojson,
feature=:cb_2015_california_county_20m
}
},
projection={type=:albersUsa},
)+
@vlplot(
:circle,
data=houses,
projection={type=:albersUsa},
longitude="longitude:q",
latitude="latitude:q",
size={value=12},
color="cluster10:n"
)
Yes, location affects price of the house but this means location as in proximity to water, prosimity to downtown, promisity to a bus stop and so on
lets' see if this remains true for the rest.
For this type of clustering, we need to build a distance matrix. We will use the Distances package for this purpose and compute the pairwise Euclidean distances.
xmatrix = Matrix(X)'
D = pairwise(Euclidean(), xmatrix, xmatrix,dims=2)
K = kmedoids(D,10)
insertcols!(houses,3,:medoids_clusters=>K.assignments)
| longitude | latitude | medoids_clusters | cluster10 | cprice | housing_median_age | total_rooms | |
|---|---|---|---|---|---|---|---|
| Float64 | Float64 | Int64 | Int64 | Int64 | Float64 | Float64 | |
| 1 | -122.23 | 37.88 | 9 | 2 | 9 | 41.0 | 880.0 |
| 2 | -122.22 | 37.86 | 9 | 2 | 7 | 21.0 | 7099.0 |
| 3 | -122.24 | 37.85 | 9 | 2 | 7 | 52.0 | 1467.0 |
| 4 | -122.25 | 37.85 | 9 | 2 | 6 | 52.0 | 1274.0 |
| 5 | -122.25 | 37.85 | 9 | 2 | 6 | 52.0 | 1627.0 |
| 6 | -122.25 | 37.85 | 9 | 2 | 5 | 52.0 | 919.0 |
| 7 | -122.25 | 37.84 | 9 | 2 | 5 | 52.0 | 2535.0 |
| 8 | -122.25 | 37.84 | 9 | 2 | 4 | 52.0 | 3104.0 |
| 9 | -122.26 | 37.84 | 9 | 2 | 4 | 42.0 | 2555.0 |
| 10 | -122.25 | 37.84 | 9 | 2 | 5 | 52.0 | 3549.0 |
| 11 | -122.26 | 37.85 | 9 | 2 | 5 | 52.0 | 2202.0 |
| 12 | -122.26 | 37.85 | 9 | 2 | 4 | 52.0 | 3503.0 |
| 13 | -122.26 | 37.85 | 9 | 2 | 4 | 52.0 | 2491.0 |
| 14 | -122.26 | 37.84 | 9 | 2 | 3 | 52.0 | 696.0 |
| 15 | -122.26 | 37.85 | 9 | 2 | 3 | 52.0 | 2643.0 |
| 16 | -122.26 | 37.85 | 9 | 2 | 2 | 50.0 | 1120.0 |
| 17 | -122.27 | 37.85 | 9 | 2 | 3 | 52.0 | 1966.0 |
| 18 | -122.27 | 37.85 | 9 | 2 | 3 | 52.0 | 1228.0 |
| 19 | -122.26 | 37.84 | 9 | 2 | 3 | 50.0 | 2239.0 |
| 20 | -122.27 | 37.84 | 9 | 2 | 3 | 52.0 | 1503.0 |
| 21 | -122.27 | 37.85 | 9 | 2 | 2 | 40.0 | 751.0 |
| 22 | -122.27 | 37.85 | 9 | 2 | 3 | 42.0 | 1639.0 |
| 23 | -122.27 | 37.84 | 9 | 2 | 2 | 52.0 | 2436.0 |
| 24 | -122.27 | 37.84 | 9 | 2 | 1 | 52.0 | 1688.0 |
| 25 | -122.27 | 37.84 | 9 | 2 | 2 | 52.0 | 2224.0 |
| 26 | -122.28 | 37.85 | 9 | 2 | 2 | 41.0 | 535.0 |
| 27 | -122.28 | 37.85 | 9 | 2 | 1 | 49.0 | 1130.0 |
| 28 | -122.28 | 37.85 | 9 | 2 | 2 | 52.0 | 1898.0 |
| 29 | -122.28 | 37.84 | 9 | 2 | 2 | 50.0 | 2082.0 |
| 30 | -122.28 | 37.84 | 9 | 2 | 2 | 52.0 | 729.0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
@vlplot(width=500, height=300) +
@vlplot(
mark={
:geoshape,
fill=:black,
stroke=:white
},
data={
values=VV,
format={
type=:topojson,
feature=:cb_2015_california_county_20m
}
},
projection={type=:albersUsa},
)+
@vlplot(
:circle,
data=houses,
projection={type=:albersUsa},
longitude="longitude:q",
latitude="latitude:q",
size={value=12},
color="medoids_clusters:n"
)
K = hclust(D)
L = cutree(K;k=10)
insertcols!(houses,3,:hclust_clusters=>L)
| longitude | latitude | hclust_clusters | medoids_clusters | cluster10 | cprice | housing_median_age | |
|---|---|---|---|---|---|---|---|
| Float64 | Float64 | Int64 | Int64 | Int64 | Int64 | Float64 | |
| 1 | -122.23 | 37.88 | 1 | 9 | 2 | 9 | 41.0 |
| 2 | -122.22 | 37.86 | 1 | 9 | 2 | 7 | 21.0 |
| 3 | -122.24 | 37.85 | 1 | 9 | 2 | 7 | 52.0 |
| 4 | -122.25 | 37.85 | 1 | 9 | 2 | 6 | 52.0 |
| 5 | -122.25 | 37.85 | 1 | 9 | 2 | 6 | 52.0 |
| 6 | -122.25 | 37.85 | 1 | 9 | 2 | 5 | 52.0 |
| 7 | -122.25 | 37.84 | 1 | 9 | 2 | 5 | 52.0 |
| 8 | -122.25 | 37.84 | 1 | 9 | 2 | 4 | 52.0 |
| 9 | -122.26 | 37.84 | 1 | 9 | 2 | 4 | 42.0 |
| 10 | -122.25 | 37.84 | 1 | 9 | 2 | 5 | 52.0 |
| 11 | -122.26 | 37.85 | 1 | 9 | 2 | 5 | 52.0 |
| 12 | -122.26 | 37.85 | 1 | 9 | 2 | 4 | 52.0 |
| 13 | -122.26 | 37.85 | 1 | 9 | 2 | 4 | 52.0 |
| 14 | -122.26 | 37.84 | 1 | 9 | 2 | 3 | 52.0 |
| 15 | -122.26 | 37.85 | 1 | 9 | 2 | 3 | 52.0 |
| 16 | -122.26 | 37.85 | 1 | 9 | 2 | 2 | 50.0 |
| 17 | -122.27 | 37.85 | 1 | 9 | 2 | 3 | 52.0 |
| 18 | -122.27 | 37.85 | 1 | 9 | 2 | 3 | 52.0 |
| 19 | -122.26 | 37.84 | 1 | 9 | 2 | 3 | 50.0 |
| 20 | -122.27 | 37.84 | 1 | 9 | 2 | 3 | 52.0 |
| 21 | -122.27 | 37.85 | 1 | 9 | 2 | 2 | 40.0 |
| 22 | -122.27 | 37.85 | 1 | 9 | 2 | 3 | 42.0 |
| 23 | -122.27 | 37.84 | 1 | 9 | 2 | 2 | 52.0 |
| 24 | -122.27 | 37.84 | 1 | 9 | 2 | 1 | 52.0 |
| 25 | -122.27 | 37.84 | 1 | 9 | 2 | 2 | 52.0 |
| 26 | -122.28 | 37.85 | 1 | 9 | 2 | 2 | 41.0 |
| 27 | -122.28 | 37.85 | 1 | 9 | 2 | 1 | 49.0 |
| 28 | -122.28 | 37.85 | 1 | 9 | 2 | 2 | 52.0 |
| 29 | -122.28 | 37.84 | 1 | 9 | 2 | 2 | 50.0 |
| 30 | -122.28 | 37.84 | 1 | 9 | 2 | 2 | 52.0 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
@vlplot(width=500, height=300) +
@vlplot(
mark={
:geoshape,
fill=:black,
stroke=:white
},
data={
values=VV,
format={
type=:topojson,
feature=:cb_2015_california_county_20m
}
},
projection={type=:albersUsa},
)+
@vlplot(
:circle,
data=houses,
projection={type=:albersUsa},
longitude="longitude:q",
latitude="latitude:q",
size={value=12},
color="hclust_clusters:n"
)
?dbscan
search: dbscan DbscanResult DbscanCluster
dbscan(D::AbstractMatrix, eps::Real, minpts::Int) -> DbscanResult
Perform DBSCAN algorithm using the distance matrix D.
The following options control which points would be considered density reachable:
eps::Real: the radius of a point neighborhoodminpts::Int: the minimum number of neighboring points (including itself) to qualify a point as a density point.dbscan(points::AbstractMatrix, radius::Real,
[leafsize], [min_neighbors], [min_cluster_size]) -> Vector{DbscanCluster}
Cluster points using the DBSCAN (density-based spatial clustering of applications with noise) algorithm.
points: the $d×n$ matrix of points. points[:, j] is a $d$-dimensional coordinates of $j$-th pointradius::Real: query radiusOptional keyword arguments to control the algorithm:
leafsize::Int (defaults to 20): the number of points binned in each leaf node in the KDTreemin_neighbors::Int (defaults to 1): the minimum number of a core point neighborsmin_cluster_size::Int (defaults to 1): the minimum number of points in a valid clusterpoints = randn(3, 10000)
# DBSCAN clustering, clusters with less than 20 points will be discarded:
clusters = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)
using Distances
dclara = pairwise(SqEuclidean(), Matrix(X)',dims=2)
L = dbscan(dclara, 0.05, 10)
@show length(unique(L.assignments))
length(unique(L.assignments)) = 15
15
insertcols!(houses,3,:dbscanclusters3=>L.assignments)
| longitude | latitude | dbscanclusters3 | hclust_clusters | medoids_clusters | cluster10 | |
|---|---|---|---|---|---|---|
| Float64 | Float64 | Int64 | Int64 | Int64 | Int64 | |
| 1 | -122.23 | 37.88 | 1 | 1 | 9 | 2 |
| 2 | -122.22 | 37.86 | 1 | 1 | 9 | 2 |
| 3 | -122.24 | 37.85 | 1 | 1 | 9 | 2 |
| 4 | -122.25 | 37.85 | 1 | 1 | 9 | 2 |
| 5 | -122.25 | 37.85 | 1 | 1 | 9 | 2 |
| 6 | -122.25 | 37.85 | 1 | 1 | 9 | 2 |
| 7 | -122.25 | 37.84 | 1 | 1 | 9 | 2 |
| 8 | -122.25 | 37.84 | 1 | 1 | 9 | 2 |
| 9 | -122.26 | 37.84 | 1 | 1 | 9 | 2 |
| 10 | -122.25 | 37.84 | 1 | 1 | 9 | 2 |
| 11 | -122.26 | 37.85 | 1 | 1 | 9 | 2 |
| 12 | -122.26 | 37.85 | 1 | 1 | 9 | 2 |
| 13 | -122.26 | 37.85 | 1 | 1 | 9 | 2 |
| 14 | -122.26 | 37.84 | 1 | 1 | 9 | 2 |
| 15 | -122.26 | 37.85 | 1 | 1 | 9 | 2 |
| 16 | -122.26 | 37.85 | 1 | 1 | 9 | 2 |
| 17 | -122.27 | 37.85 | 1 | 1 | 9 | 2 |
| 18 | -122.27 | 37.85 | 1 | 1 | 9 | 2 |
| 19 | -122.26 | 37.84 | 1 | 1 | 9 | 2 |
| 20 | -122.27 | 37.84 | 1 | 1 | 9 | 2 |
| 21 | -122.27 | 37.85 | 1 | 1 | 9 | 2 |
| 22 | -122.27 | 37.85 | 1 | 1 | 9 | 2 |
| 23 | -122.27 | 37.84 | 1 | 1 | 9 | 2 |
| 24 | -122.27 | 37.84 | 1 | 1 | 9 | 2 |
| 25 | -122.27 | 37.84 | 1 | 1 | 9 | 2 |
| 26 | -122.28 | 37.85 | 1 | 1 | 9 | 2 |
| 27 | -122.28 | 37.85 | 1 | 1 | 9 | 2 |
| 28 | -122.28 | 37.85 | 1 | 1 | 9 | 2 |
| 29 | -122.28 | 37.84 | 1 | 1 | 9 | 2 |
| 30 | -122.28 | 37.84 | 1 | 1 | 9 | 2 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
@vlplot(width=500, height=300) +
@vlplot(
mark={
:geoshape,
fill=:black,
stroke=:white
},
data={
values=VV,
format={
type=:topojson,
feature=:cb_2015_california_county_20m
}
},
projection={type=:albersUsa},
)+
@vlplot(
:circle,
data=houses,
projection={type=:albersUsa},
longitude="longitude:q",
latitude="latitude:q",
size={value=12},
color="dbscanclusters3:n"
)
After finishing this notebook, you should be able to:
Prices in California do not seem to have an exact mapping with geographical locations. In specifc, performing a clustering algorithm on the houses dataset we had did not reveal a mapping with the price ranges. This indicate that prices relationship to geographical location is not necessairly based on neighborhood but probably other factors like closeness to the water or closeness to a downtown.